Using Synthetic Data for RAG Setup

As a Machine Learning Engineer, you often face the challenge of having too little labeled data. When you realize this, the typical response is to spend months collecting and labeling data before you can start developing a solution.

However, with the introduction of Large Language Models (LLMs), the process has changed. Now, you can quickly test ideas or develop AI features using the generalization abilities of LLMs. If your initial tests show promise, you can then move into the traditional development phase.

The Shift in AI-Powered Products

One of the new methods gaining traction is called Retrieval Augmented Generation (RAG). This approach is especially useful for tasks that require specific knowledge beyond what the model can provide on its own. RAG combines a system that retrieves relevant information with a text generator. For more details, check the relevant section in the guide.

A crucial part of RAG is the retrieval model, which finds relevant documents and sends them to the LLM for further processing. The success of your product or feature often depends on how well this retrieval model performs. While it usually works well, its effectiveness can drop in different languages or specialized fields.

For example, if you want to create a chatbot that answers questions about Czech laws, or a tax assistant for the Indian market (as shown in OpenAI's GPT-4 presentation), you might discover that the retrieval model struggles to find the most relevant documents. This can negatively affect the overall quality of your system.

A Solution: Generating Synthetic Data

There's a growing trend of using existing LLMs to create synthetic data for training new LLMs, retrieval models, and other types of models. This process can be thought of as extracting essential information from LLMs to create standard-sized encoders using prompt-based query generation. Although this distillation process requires significant computational resources, it can greatly reduce costs and improve performance, particularly for low-resource languages or niche topics.

In this guide, we will use advanced text generation models like ChatGPT and GPT-4 to create large amounts of synthetic content based on your instructions. Research by Dai et al. (2022) shows that with just eight manually labeled examples and a large set of unlabeled documents (like parsed laws), you can achieve nearly state-of-the-art performance. This means synthetic data can be a powerful tool for training specialized retrieval models when labeled data is scarce.

Generating Domain-Specific Datasets

To leverage LLMs effectively, you'll need to provide a brief description and label a few examples manually. It’s essential to recognize that different retrieval tasks have various search intents, which means the definition of "relevance" can change. For instance, in an argument retrieval task, you might look for supporting arguments, while another task might require counter-arguments.

Here’s a simple example (written in English for clarity, but you can use any language). 

Prompt Example:

- Task: Find a counter-argument for the given argument.
- Argument #1: {insert argument here}
- Query for Argument #1: {insert your query here}
- Argument #2: {insert argument here}
- Query for Argument #2: {insert your query here}
- Argument N: Even if fines are based on income, this won't achieve equal impact because other factors need to be considered, like family responsibilities and overall wealth.
- Query for Argument #N:

Output:

- Query: punishment house would make fines relative income

Generally, your prompt will look like this:

Prompt Structure:

```
(prompt, doc(document_1), query(query_1), ..., doc(document_k), query(query_k), doc(d))
```

Here, `doc` and `query` refer to the specific documents and queries related to your task, and `prompt` is the instruction for ChatGPT/GPT-4 to follow. The last document and its generated query will be used to further train your local model. This method is beneficial when you have a target set of documents but limited annotated query-document pairs for your new task.

Overview of the Data Generation Process

It’s important to handle the manual labeling of examples carefully. Instead of just a few, it’s better to prepare more examples (like 20) and randomly select a few (2-8) to use in your prompt. This helps diversify the generated data without requiring too much extra annotation time. However, your examples should be well-formatted and specific regarding things like query length and tone. The better your examples and instructions are, the more useful your synthetic data will be for training the retrieval model. Poor-quality examples can hurt the performance of the trained model.

In most cases, using a more affordable model like ChatGPT will be sufficient, especially for unique domains and languages beyond English. For example, a prompt with 4-5 examples usually takes about 700 tokens (assuming each passage is about 128 tokens long) plus an additional 25 tokens for the generated content. Therefore, generating synthetic data for 50,000 documents for training your local model would cost around $55. This is much cheaper than gathering 10,000 examples manually, which could take a month and cost over a thousand dollars. With the techniques you’ve learned, you could see significant improvements in just a few days!

